library(tidyverse)
# library(broom)
library(patchwork)
library(scales)
library(dagitty)
library(ggdag)
library(latex2exp) # Easily convert LaTeX into arcane plotmath expressions
library(ggtext) # Use markdown in ggplot labels
# Create a cleaner serifed theme to use throughout
theme_do_calc <- function() {
theme_dag(base_family = "Times New Roman") +
theme(plot.title = element_text(size = rel(1.5)),
plot.subtitle = element_markdown())
}
# Make all geom_dag_text() layers use these settings automatically
update_geom_defaults(ggdag:::GeomDagText, list(family = "Times New Roman",
fontface = "bold",
color = "black"))
In all the examples that follow, we are trying to estimate a causal
relationship between x and y. I think this is
\(P(Y| \mbox{do}(X=x))\) NEED
TO CHECK THIS
This represents a sequential path from intervention to outcome via a
confounder (x → z → y). Can bias
causal link between x and y
chain_dag <- dagify(z ~ x,
y ~ z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(chain_dag) +
theme_dag()
Independence: x and y are independent
(conditioned on z)
impliedConditionalIndependencies(chain_dag)
## x _||_ y | z
Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome
ggdag_adjustment_set(chain_dag) +
theme_dag()
ggdag_dseparated(chain_dag) +
theme_dag()
ggdag_dseparated(chain_dag, controlling_for = "z") +
theme_dag()
Simulated data - Equations
x: \(N(5, 1)\)x ~ z: \(\beta_1 =
4\)y ~ z: \(\beta_2 =
2\)n <- 1000
x <- rnorm(n, 5, 1)
z <- x * 4 + rnorm(n, 0, 2)
y <- z * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Ignoring the confounder results in
spurious relationship between intervention and outcome
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2218 -3.0478 0.0459 2.9456 12.5710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3674 0.7140 -1.915 0.0558 .
## x 8.2846 0.1408 58.843 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.435 on 998 degrees of freedom
## Multiple R-squared: 0.7763, Adjusted R-squared: 0.776
## F-statistic: 3462 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This removes the backdoor path
between x and y and removes the apparent
relationship. Note that the coefficient between z and
y is correct.
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8294 -1.4598 -0.0431 1.4187 7.6018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.75773 0.33720 -2.247 0.0249 *
## x 0.32995 0.15031 2.195 0.0284 *
## z 1.95797 0.03318 59.005 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.093 on 997 degrees of freedom
## Multiple R-squared: 0.9502, Adjusted R-squared: 0.9501
## F-statistic: 9510 on 2 and 997 DF, p-value: < 2.2e-16
chain_dag <- dagify(z ~ x,
y ~ x + z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(chain_dag) +
theme_dag()
Independence: x and y are independent
(conditioned on z)
impliedConditionalIndependencies(chain_dag)
Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome
ggdag_adjustment_set(chain_dag) +
theme_dag()
Simulated data - Equations
x: \(N(5, 1)\)z ~ x: \(\beta_1 =
4\)y ~ x + z: \(\beta_2 = -1.5,
\beta_3 = 2\)n <- 1000
x <- rnorm(n, 10, 5)
z <- x * 5 + rnorm(n, 0, 2)
y <- x * -1.5 + z * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Ignoring the confounder results in
biased relationship between intervention and outcome
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5126 -2.8665 -0.0556 3.1034 12.5571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26647 0.30534 0.873 0.383
## x 8.49081 0.02794 303.901 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.474 on 998 degrees of freedom
## Multiple R-squared: 0.9893, Adjusted R-squared: 0.9893
## F-statistic: 9.236e+04 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This removes the backdoor path
between x and y and results in correct
coefficient estimation.
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7522 -1.3014 0.0372 1.4069 6.1381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05084 0.13625 -0.373 0.709
## x -1.49681 0.15797 -9.475 <2e-16 ***
## z 2.00111 0.03155 63.422 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.995 on 997 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9979
## F-statistic: 2.343e+05 on 2 and 997 DF, p-value: < 2.2e-16
chain_dag <- dagify(z1 ~ x,
z2 ~ z1,
y ~ x + z2,
coords = list(x = c(x = 1, y = 3, z1 = 1.5, z2 = 2.5),
y = c(x = 1, y = 1, z1 = 2, z2 = 1.5)),
exposure = "x",
outcome = "y")
ggdag(chain_dag) +
theme_dag()
Independence: x and y are independent
(conditioned on z)
impliedConditionalIndependencies(chain_dag)
## x _||_ z2 | z1
## y _||_ z1 | x, z2
Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome
ggdag_adjustment_set(chain_dag) +
theme_dag()
Simulated data - Equations
x: \(N(5, 1)\)z1 ~ x: \(\beta_1 =
4\)z2 ~ z1: \(\beta_2 =
-2\)y ~ x + z2: \(\beta_3 = -1.5,
\beta_4 = 2\)n <- 1000
x <- rnorm(n, 10, 5)
z1 <- x * 4 + rnorm(n, 0, 2)
z2 <- z1 * -2 + rnorm(n, 0, 2)
y <- x * -1.5 + z2 * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z1, z2)
pairs(df)
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.2589 -6.3711 -0.1723 6.1312 25.3973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.43839 0.63838 -0.687 0.492
## x -17.44859 0.05827 -299.427 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.244 on 998 degrees of freedom
## Multiple R-squared: 0.989, Adjusted R-squared: 0.989
## F-statistic: 8.966e+04 on 1 and 998 DF, p-value: < 2.2e-16
fit <- lm(y ~ x + z1, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9865 -3.0350 0.2261 3.1331 14.4222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.21892 0.31814 -0.688 0.492
## x -1.42480 0.29293 -4.864 1.34e-06 ***
## z1 -4.01454 0.07303 -54.972 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.606 on 997 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.9973
## F-statistic: 1.82e+05 on 2 and 997 DF, p-value: < 2.2e-16
fit <- lm(y ~ x + z2, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8510 -1.3512 -0.0548 1.4493 7.3408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01180 0.13868 0.085 0.932
## x -1.29307 0.11448 -11.295 <2e-16 ***
## z2 2.02657 0.01427 141.993 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.008 on 997 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 9.605e+05 on 2 and 997 DF, p-value: < 2.2e-16
fit <- lm(y ~ x + z1 + z2, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z1 + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8273 -1.3515 -0.0677 1.4441 7.2973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01463 0.13867 0.106 0.916
## x -1.35956 0.12764 -10.651 <2e-16 ***
## z1 0.08286 0.07041 1.177 0.240
## z2 2.05971 0.03158 65.231 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.007 on 996 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 6.406e+05 on 3 and 996 DF, p-value: < 2.2e-16
These occur when a confounding variable impacts both the intervention
(x) and the outcome (Y). Can result in a
spurious correlations (i.e. x and y will be
correlated without existing causality).
This has no causal link between x and y,
and a single confounder (z)
fork_dag <- dagify(x ~ z,
y ~ z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence: x and y are independent
(conditioned on z)
impliedConditionalIndependencies(fork_dag)
## x _||_ y | z
Adjustment set: shows what adjustment is necessary to obtain the causal link between input and outcome
ggdag_adjustment_set(fork_dag) +
theme_dag()
Simulated data - Equations
z: \(N(10, 5)\)x ~ z: \(\beta_1 =
5\)y ~ z: \(\beta_2 =
2\)n <- 1000
z <- rnorm(n, 10, 5)
x <- z * 5 + rnorm(n, 0, 2)
y <- z * 2 + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Ignoring the confounder results in
spurious relationship between intervention and outcome
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1274 -1.4533 0.0301 1.5496 6.3088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.067120 0.153951 0.436 0.663
## x 0.398542 0.002802 142.238 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.192 on 998 degrees of freedom
## Multiple R-squared: 0.953, Adjusted R-squared: 0.9529
## F-statistic: 2.023e+04 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This removes the backdoor path
between x and y and removes the apparent
relationship
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9830 -1.3471 -0.0494 1.4390 5.8065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02228 0.14122 0.158 0.875
## x -0.04192 0.03208 -1.307 0.192
## z 2.20828 0.16032 13.774 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.01 on 997 degrees of freedom
## Multiple R-squared: 0.9605, Adjusted R-squared: 0.9604
## F-statistic: 1.212e+04 on 2 and 997 DF, p-value: < 2.2e-16
This is the same graph as before, but now we include a real
relationship between x and y
fork_dag <- dagify(x ~ z,
y ~ x + z,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence: Now x and y have no
conditional independence
impliedConditionalIndependencies(fork_dag)
Adjustment set:
ggdag_adjustment_set(fork_dag) +
theme_dag()
Equations
z: \(N(10, 5)\)x ~ z: \(\beta_1 =
5\)y ~ x + z: \(\beta_2 =
-4\), \(\beta_3 = 2\)n <- 1000
z <- rnorm(n, 10, 5)
x <- z * 5 + rnorm(n, 0, 5)
y <- x * -4 + z * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x. Without controlling for the confounder,
the coefficient relating x and y is biased
(the relationship exists, but it’s biased)
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6007 -3.8928 -0.1075 3.6774 16.6773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.248996 0.373976 3.34 0.00087 ***
## x -3.624424 0.006655 -544.61 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 998 degrees of freedom
## Multiple R-squared: 0.9966, Adjusted R-squared: 0.9966
## F-statistic: 2.966e+05 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. Now we get the right value for
the coefficient (\(\sim4\))
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2661 -3.6346 0.0608 3.5008 15.0201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.61806 0.36302 1.703 0.089 .
## x -3.93228 0.03199 -122.903 <2e-16 ***
## z 1.60482 0.16346 9.818 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.076 on 997 degrees of freedom
## Multiple R-squared: 0.9969, Adjusted R-squared: 0.9969
## F-statistic: 1.625e+05 on 2 and 997 DF, p-value: < 2.2e-16
This is to illustrate Pearl’s point about uncessary confounders. This
graph has a backdoor path with two confounding variables
(z1 and z2). In order to test a causal link
between x and y, we only need to
control for one, as either will block the back door path
fork_dag <- dagify(x ~ z1,
z2 ~ z1,
y ~ z2,
coords = list(x = c(x = 1, y = 3, z1 = 1.5, z2 = 2.5),
y = c(x = 1, y = 1, z1 = 2, z2 = 1.5)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence:
impliedConditionalIndependencies(fork_dag)
## x _||_ y | z2
## x _||_ y | z1
## x _||_ z2 | z1
## y _||_ z1 | z2
Adjustment set: Now we get two adjustment sets, as controlling for
either z1 or z2 removes the path
ggdag_adjustment_set(fork_dag) +
theme_dag()
Equations
z1: \(N(10, 5)\)x ~ z1: \(\beta_1 =
5\)y ~ z2: \(\beta_2 =
2\)z2 ~ z1: \(beta_3 =
-4\)n <- 1000
z1 <- rnorm(n, 10, 5)
x <- z1 * 5 + rnorm(n, 0, 5)
z2 <- z1 * -4 + rnorm(n, 0, 5)
y <- z2 * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z1, z2)
pairs(df)
Model of y ~ x (shows spurious relationship)
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.905 -8.940 -0.692 9.559 47.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.52003 0.94845 -2.657 0.00801 **
## x -1.55071 0.01683 -92.140 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.77 on 998 degrees of freedom
## Multiple R-squared: 0.8948, Adjusted R-squared: 0.8947
## F-statistic: 8490 on 1 and 998 DF, p-value: < 2.2e-16
Control for z1
fit1 <- lm(y ~ x + z1, df)
summary(fit1)
##
## Call:
## lm(formula = y ~ x + z1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.411 -7.670 0.018 7.818 36.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.42923 0.79373 0.541 0.589
## x -0.01813 0.07197 -0.252 0.801
## z1 -7.95320 0.36645 -21.703 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.36 on 997 degrees of freedom
## Multiple R-squared: 0.9286, Adjusted R-squared: 0.9284
## F-statistic: 6480 on 2 and 997 DF, p-value: < 2.2e-16
Control for z2
fit2 <- lm(y ~ x + z2, df)
summary(fit2)
##
## Call:
## lm(formula = y ~ x + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8806 -3.5402 0.1354 3.3948 13.7955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.179837 0.345638 0.52 0.603
## x -0.005798 0.019989 -0.29 0.772
## z2 1.998835 0.024626 81.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.996 on 997 degrees of freedom
## Multiple R-squared: 0.9862, Adjusted R-squared: 0.9861
## F-statistic: 3.556e+04 on 2 and 997 DF, p-value: < 2.2e-16
Control for both (this works, but is more costly in regression terms)
fit3 <- lm(y ~ x + z1 + z2, df)
summary(fit3)
##
## Call:
## lm(formula = y ~ x + z1 + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0223 -3.4918 0.1094 3.3704 13.7611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13739 0.34927 0.393 0.694
## x -0.02669 0.03166 -0.843 0.399
## z1 0.17418 0.20469 0.851 0.395
## z2 2.01523 0.03127 64.451 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.997 on 996 degrees of freedom
## Multiple R-squared: 0.9862, Adjusted R-squared: 0.9861
## F-statistic: 2.37e+04 on 3 and 996 DF, p-value: < 2.2e-16
This is to illustrate Pearl’s point about uncessary confounders. This
graph has a backdoor path with two confounding variables
(z1 and z2). In order to test a causal link
between x and y, we only need to
control for one, as either will block the back door path. This also adds
a path between x and y
fork_dag <- dagify(x ~ z1,
z2 ~ z1,
y ~ x + z2,
coords = list(x = c(x = 1, y = 3, z1 = 1.5, z2 = 2.5),
y = c(x = 1, y = 1, z1 = 2, z2 = 1.5)),
exposure = "x",
outcome = "y")
ggdag(fork_dag) +
theme_dag()
Independence:
impliedConditionalIndependencies(fork_dag)
## x _||_ z2 | z1
## y _||_ z1 | x, z2
Adjustment set: Now we get two adjustment sets, as controlling for
either z1 or z2 removes the path
ggdag_adjustment_set(fork_dag) +
theme_dag()
Equations
z1: \(N(10, 5)\)x ~ z1: \(\beta_1 =
5\)y ~ x + z2: \(\beta_2 = 1.5;
\beta_3 = 2\)z2 ~ z1: \(beta_4 =
-4\)n <- 1000
z1 <- rnorm(n, 10, 5)
x <- z1 * 5 + rnorm(n, 0, 5)
z2 <- z1 * -4 + rnorm(n, 0, 5)
y <- x * 1.5 + z2 * 2 + rnorm(n, 0, 5)
df <- data.frame(x, y, z1, z2)
pairs(df)
Model of y ~ x (biased coefficient, only weakly
significant)
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.151 -9.467 0.234 9.982 41.820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.41619 1.01002 -1.402 0.161185
## x -0.06451 0.01770 -3.644 0.000282 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.96 on 998 degrees of freedom
## Multiple R-squared: 0.01313, Adjusted R-squared: 0.01214
## F-statistic: 13.28 on 1 and 998 DF, p-value: 0.0002825
Control for z1
fit1 <- lm(y ~ x + z1, df)
summary(fit1)
##
## Call:
## lm(formula = y ~ x + z1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.369 -7.542 -0.112 7.463 33.935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.55144 0.86353 1.797 0.0727 .
## x 1.47943 0.07779 19.018 <2e-16 ***
## z1 -7.99497 0.39536 -20.222 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.76 on 997 degrees of freedom
## Multiple R-squared: 0.3002, Adjusted R-squared: 0.2988
## F-statistic: 213.8 on 2 and 997 DF, p-value: < 2.2e-16
Control for z2
fit2 <- lm(y ~ x + z2, df)
summary(fit2)
##
## Call:
## lm(formula = y ~ x + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1352 -3.3773 -0.0024 3.1261 17.7220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15596 0.36999 -0.422 0.673
## x 1.52706 0.02084 73.259 <2e-16 ***
## z2 2.02683 0.02523 80.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.11 on 997 degrees of freedom
## Multiple R-squared: 0.8679, Adjusted R-squared: 0.8677
## F-statistic: 3276 on 2 and 997 DF, p-value: < 2.2e-16
Control for both (this works, but is more costly in regression terms)
fit3 <- lm(y ~ x + z1 + z2, df)
summary(fit3)
##
## Call:
## lm(formula = y ~ x + z1 + z2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1363 -3.3903 -0.0061 3.1655 17.6646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.19812 0.37618 -0.527 0.599
## x 1.51040 0.03381 44.678 <2e-16 ***
## z1 0.13272 0.21197 0.626 0.531
## z2 2.03825 0.03114 65.454 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.111 on 996 degrees of freedom
## Multiple R-squared: 0.868, Adjusted R-squared: 0.8676
## F-statistic: 2183 on 3 and 996 DF, p-value: < 2.2e-16
This is the second type of path junction that can cause problems.
Here, our two variables (x and y) are
independent, but are both linked to z, the collider. Ths
issue with this relationship is the bias that results when we
control for the collider
collider_dag <- dagify(z ~ x,
z ~ y,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(collider_dag) +
theme_dag()
Independence: In the case, there is no conditional independence.
x and y are unconditionally independent
impliedConditionalIndependencies(collider_dag)
## x _||_ y
Adjustment set: now we do not need to adjust our model
ggdag_adjustment_set(collider_dag) +
theme_dag()
Separation: This is a test for \(d\)-separation in a DAG. \(d\)-separation indicates that two variables are independent
ggdag_dseparated(collider_dag) +
theme_dag()
We can also test for \(d\)-separation when we control for
z. THis now shows that a backdoor path has been opened by
the control, resulting in a spurious link between x and
y (MAYBE ADD DSEP TO CONFOUNDER
EXAMPLES?)
ggdag_dseparated(collider_dag, controlling_for = "z") +
theme_dag()
Equations
x: \(N(10, 2)\)y: \(N(5, 1)\)z ~ x + y: \(\beta_1 =
5\), \(\beta_2 = 2\)n <- 1000
x <- rnorm(n, 10, 2)
y <- rnorm(n, 5, 1)
z <- 5 * x + 2 * y + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x - this returns the unbiased link (or lack
thereof) between x and y given the
collider
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3652 -0.6408 -0.0312 0.6626 3.6259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.78428 0.16522 28.958 <2e-16 ***
## x 0.02307 0.01608 1.434 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.017 on 998 degrees of freedom
## Multiple R-squared: 0.002057, Adjusted R-squared: 0.001058
## F-statistic: 2.058 on 1 and 998 DF, p-value: 0.1518
Model controlling for z. This now results in a spurious
relationship between x and y
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.76946 -0.45965 0.00578 0.48680 2.17194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.595289 0.132881 19.53 <2e-16 ***
## x -1.239743 0.040190 -30.85 <2e-16 ***
## z 0.246690 0.007541 32.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7069 on 997 degrees of freedom
## Multiple R-squared: 0.5187, Adjusted R-squared: 0.5177
## F-statistic: 537.1 on 2 and 997 DF, p-value: < 2.2e-16
collider_dag <- dagify(z ~ x,
z ~ y,
y ~ x,
coords = list(x = c(x = 1, y = 3, z = 2),
y = c(x = 1, y = 1, z = 2)),
exposure = "x",
outcome = "y")
ggdag(collider_dag) +
theme_dag()
Independence: In the case, there is no conditional independence.
x and y are unconditionally independent
impliedConditionalIndependencies(collider_dag)
Adjustment set: now we do not need to adjust our model
ggdag_adjustment_set(collider_dag) +
theme_dag()
Separation: This is a test for \(d\)-separation in a DAG. \(d\)-separation indicates that two variables are independent
ggdag_dseparated(collider_dag) +
theme_dag()
We can also test for \(d\)-separation when we control for
z. THis now shows that a backdoor path has been opened by
the control, resulting in a spurious link between x and
y (MAYBE ADD DSEP TO CONFOUNDER
EXAMPLES?)
ggdag_dseparated(collider_dag, controlling_for = "z") +
theme_dag()
Equations
x: \(N(10, 2)\)y: \(N(5, 1) + \beta_3
x\). \(\beta_3 = 1.5\)z ~ x + y: \(\beta_1 =
5\), \(\beta_2 = 2\)n <- 1000
x <- rnorm(n, 10, 2)
y <- rnorm(n, 5, 1) + 1.5 * x
z <- 5 * x + 2 * y + rnorm(n, 0, 2)
df <- data.frame(x, y, z)
pairs(df)
Model of y ~ x - this returns the unbiased coefficient
between x and y given the collider
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2137 -0.6293 0.0056 0.6603 2.6075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97442 0.15626 31.83 <2e-16 ***
## x 1.50256 0.01532 98.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9451 on 998 degrees of freedom
## Multiple R-squared: 0.906, Adjusted R-squared: 0.9059
## F-statistic: 9618 on 1 and 998 DF, p-value: < 2.2e-16
Model controlling for z. This now results in a biased
coefficient between x and y
fit <- lm(y ~ x + z, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.21038 -0.46818 0.04043 0.44992 1.98906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.621296 0.137928 19.005 < 2e-16 ***
## x -0.402075 0.064518 -6.232 6.79e-10 ***
## z 0.237708 0.007932 29.969 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6858 on 997 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9504
## F-statistic: 9581 on 2 and 997 DF, p-value: < 2.2e-16